Skip to content

Add Spark interop test for reading Puffin deletion vectors#3476

Open
moomindani wants to merge 1 commit into
apache:mainfrom
moomindani:moomindani/puffin-dv-reader-interop-test
Open

Add Spark interop test for reading Puffin deletion vectors#3476
moomindani wants to merge 1 commit into
apache:mainfrom
moomindani:moomindani/puffin-dv-reader-interop-test

Conversation

@moomindani

Copy link
Copy Markdown

Rationale for this change

Extracted from #3474 per review feedback (#3474 (comment)): this integration test verifies that PyIceberg can read Puffin deletion vectors written by Spark, which holds independently of the PuffinWriter changes in that PR.

Are these changes tested?

This PR is test-only. The test passed CI as part of #3474 (integration-test job) before being extracted.

Are there any user-facing changes?

No.

@ebyhr ebyhr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except for comments.

Comment on lines +25 to +27
def run_spark_commands(spark: SparkSession, sqls: list[str]) -> None:
for sql in sqls:
spark.sql(sql)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain if we truly require this method, as calling the sql method is simpler.

spark.sql(f"DROP TABLE IF EXISTS {identifier}")
vs
run_spark_commands(spark, [f"DROP TABLE IF EXISTS {identifier}"])

No requested change. I know there's already a similar method in other classes.



@pytest.mark.integration
def test_read_spark_written_puffin_dv(spark: SparkSession, session_catalog: RestCatalog) -> None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this test to test_deletes.py?

Comment on lines +64 to +68
assert len(delete_manifests) > 0, "Expected delete manifest with DVs"

delete_manifest = delete_manifests[0]
entries = list(delete_manifest.fetch_manifest_entry(table.io))
assert len(entries) > 0, "Expected at least one delete file entry"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we avoid using the ... == 1 condition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants